Alessandro Pomponio - 0000920265

  1. the boxplots of the attributes and a comment on remarkable situations, if any (2pt)
  2. a pairplot of the data (see Seaborn pairplot) and a comment on remarkable situations, if any (2pt)
  3. a clustering schema using a method of your choice exploring a range of parameter values (5pt)
  4. the plot of the global inertia (SSD) and silhouette index for the parameter values you examine (4pt)
  5. the optimal parameters of your choice (4pt)
  6. a pairplot of the data using as hue the cluster assignment with the optimal parameter (3pt)
  7. a plot of the silhouette index for the data points, grouped according to the clusters (4pt)
  8. A sorted list of the discovered clusters for decreasing sizes (7pt)

1. the boxplots of the attributes and a comment on remarkable situations, if any (2pt)

The boxplots show that there are no outliers, the distribution of 0 and 3 is very similar. 1 and 2 have a similar median value but different distribution of values. There doesn't seem to be any particular situaion showing.

2. a pairplot of the data (see Seaborn pairplot) and a comment on remarkable situations, if any (2pt)

From the pairplot it is clear that the columns 1 and 2 tend to form quite distinct clusters. They're probably our best bet for our clustering efforts.

3. a clustering schema using a method of your choice exploring a range of parameter values (5pt)

In order to find a clustering scheme, we will use K-means with the elbow method, ranging from 2 to 10 clusters

4. the plot of the global inertia (SSD) and silhouette index for the parameter values you examine (4pt)

5. the optimal parameters of your choice (4pt)

Both the silhouette scores and the inertia elbow suggest that the best number of clusters is 4, which is in line with what we were expecting, given the initial pairplots

6. a pairplot of the data using as hue the cluster assignment with the optimal parameter (3pt)

In order to use the predicted labels as hue we will add it to a new dataframe using the assign method

7. a plot of the silhouette index for the data points, grouped according to the clusters (4pt)

In order to perform this task, we will use the plot_silhouette function that was introduced in the exercises in class

7. A sorted list of the discovered clusters for decreasing sizes (7pt)

To make this task easier, we leverage numpy's function bincount

bincount created an array that contains as index the cluster numbers, as value, the elements in that cluster. We can then create tuples to have this association in an explicit way

We can now sort the tuples and extract the cluster index to obtain what was requested